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ABSTRACT 

Determining the most suitable indices to use in evaluating 
empirical results is a natter of considerable debate among 
researchers (Chow, 1988; Huberty, 1987; Kupfersmid, 1988; Rosnow 
& Rosenthal, 1989; Thompson, 1989b). Researchers increasingly 
recognize that significance tests are very limited in their 
potential to inform the interpretation of scientific results 
(Carver, 1978) . Three strategies for augmenting the interpretation 
of significance test results are illustrated. 
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Detemininc/ the most suitable indices to use in evaluating 
empirical results is a matter of considerable debate among 
researchers (chow, 1988; Huberty, 1987? Kupforsmid, 1988; Rosnow 
& Rosenthal, 1989; Thompson, 1989b). Researchers increasingly 
recognize that significance tests are very limited in their 

potential to inform the interpretation of scientific results 
(Carver, 1978). Three strategies for augmenting the interpretation 

of significance test results are illustrated hera. 

An Hiatericiil perspectlv*^ 
However, it may be worthwhile to provide an historical 

perspective on just how far researchers have come in recognizing 

the potentials and the limits of statistical significance testing. 

Consider first the position statement of Melton (1962, p. 554) 

following 12 yej rs of service as editor of the Journal of 

Experimental Educat i . 

In editing the Journal there has been a strong 
reluctance to accept and publish results related to 
the principal concern of the researcher when those 

results were [only] significant at the .05 level 

It reflects a belief th«t it the responsibility of 
the investigator in a science to reveal his effect 
in such a way that no reasonable man would be in a 
position to discredit the results by saying that 
they were the product of the way the te?.ll bounces. 

Consider in comparison a statement from one of the several (cf. 

Kupfersmid, 1988; Meehl, 1978) articlas published more recently in 

1 
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prominent journals In psychology: 

It may not be an exaggeration to say that for many 
PhD studei.ts, for whom the .05 alpha has acguixed an 
almost ontological mystique, it can mean joy, a 
doctoral degree, and a tenure-track position at a 
major university if their dissertation fi is less 
than .05.... [But] surely, God loves the .06 nearly 
as much as the .05 [level]. (Rosnov & Rosenthal, 
1989, p. 1277) 

Social science has certainly come a long way during the last few 
years in recognizing the essential limits of significance testsl 
Sianlfloanrse aa a TA« t of Sample y^ jg^ 
Even some widely respected authors of prominent textbooks are 
sometimes not quite sure what role significance tests should play 
in analysis (Thompson, l987a, 1988d) , and some dissertation authors 
too may be disproportionately susceptible to excessive awe for 
significance tests (Eason & Daniel, 1989; Thompson, 1988b). 
Researchers who have had the fortunate experience of working with 
large samples (cf. Kaiser, 1976) soon realize that virtually all 
null hypotheses will be rejected, since "tne null hypothesis of no 
difference is almost never sxactly true in the population" 
(Thompson, 1987b, p. 14) . As Meehl (1978, p. 822) notes, "As I 
believe is generally recognized by statisticians today end by 
thoughtful social scientists, the null hypothesis, taken literally, 
is always false." Thus Hays (1981, p. 293) aigues that "virtually 
any study can be made to show significant results if one uses 
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enough subjects." a concrete heuristic example »ay serve to 
emphasize this point. 

Presume that a researcher was working in the Houston school 
district, and analyzed data involving some of the district's 
200,000 students. Perchance the researcher decided to compare the 
«ean IQ scores of 12,000 students located in one zip code with the 
mean iq of the 188,000 regaining students residing in other zip 
codes. Since the t distribution approaches the a distribution as 
sample size approaches infinity, researchers use the a distribution 
to tests mean differences with large samples. These calculations 
are reported in Table l. 



INSERT TABLE 1 ABOUT HERE. 



The mean IQ (100.15, SD-IS) of the 12,000 students residing 
in the zip code of interest differs to a statistically significant 
degree (zcalc - 2.12 > Zcrit - 1.96, fi<.05) from the mean (99.85, 
SD=15) of the r«naining 188,000 students. The less thoughtful 
rosearcher might suggest to school board members that special 
schools for gifted students should be erect.»d throughout the zip 
code of the 12,000 students, since thay are "significantly- 
brighter than their compatriots. 

Alternatively, the more thoughtful researcher in such a 
situation would note that the standardized difference in these two 
»eans (.3/15 - o.02) is trivial. The difference of means (.3 - 
one-third of one IQ point) is also substantially smaller than one 
standard error of an IQ measure with a reliability coefficient of 
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0.92, i.e., SEM - SD*((l-r)**.5) - 4.24. Such a thoughtful 
researcher would be reticent to extrapolate policy recommendations 
from every statistically significant result. As Huberty (1987, p. 
6) notes, "it would be well to have some idea as to the approximate 
power (i.e., i - beta) one has for sop^ 'important* or 
•interesting* alternative hypothesis characterizations, given a 
particular alpha." 

Morrison and Henkel (1970) and Carver (1978) provide 
historically important and incisive explanations of the limits of 
significance testing as an aid to interpretation. Although 
significance is a function of at least seven interrelated features 
of a study (Schneider & Darcy, 1984), sample size is the primary 
influence on significance. To some extent significance tests 
evaluate the size cf the researcher *s sample—most researchers 
already know prior to conducting significance tests whether the 
sample in hand is large or small, so these outcon-es do not always 
result in incisive insight that would be lost absent a significance 
test. 

Interpreting Sianiflea nce Testa in a Sample Size r^on tg^ 
The first strategy for augmenting interpretation of 
significance tests involves evalua»<na sianiflcanng uesfc r.^«nH->, 
in a sample size gpnt^Xt . The researcher is encouraged to determine 
at what smaller sample size a statistically significant fixed 
effect size would no longer be significant, or conversely, at what 
larger.- sample size a nonsignificant result would become 
statistically significant (Thompson, 1989a) . 



Table 2 illustrates this application. The table presents 
significance tests associated with varying sample sizes and large 
(33.6%) fixed effect cizes. The table can be viewed as presenting 
results for either a multiple regression analysis involving two 
predictor variables (in which case the "r sq" effect size would be 
called the squared multiple correlation coefficient, ^) or an 
analysis of variance involving an omnibus test of differences in 
three means in a one-way design (in which case the "r sq" effect 
size would be called the Cvorrelation ratio or eta squared) . 



INSERT TABLE 2 ABOUT HERE. 

The table presents results for fixed effect sizes but 
increasing sample sizes (4, 13, 23, or 33). Por the 33.6% effect 
size reported in Table 2, the result becomes statistically 
significant when there are somewhere between 13 and 23 subjects in 
tho analysis. 

The researcher who does not genuinely understand statistical 
significance would differentially interpret the effect size of 
33.6% when there were 13 versus 23 subjects in the analysis. Yet 
the effect sizes within the table are fixed. Empirical studies of 
research practice indicate that superficial understanding of 
significance testing has actually led to serious distortions such 
as researchers interpreting significant results involving small 
effect sizes while ignoring nonsignificant results involving large 
effect sizes (Craig, Eison & Metze, 1976)1 

intgrpminq Fffprt size as an Tndex of p««m t Tin»«rf^n. ^ 
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Many effect size estinates (e.g., Hays, 1981; Tatsuoka, 1973) 
are available for researchers who wish to gamer some insight 
regarding result importance. The simplest effect sizes are 
analogous to the coefficient of determination (r*) . Por example, 
in analysis of variance the sum of squares for an effect can be 
divided by the SOS total to compute the correlation ratio (also 
called eta squared). Such statistics inform the researcher 
regarding what proportion of variance in the dependent variable (s) 
is explained by a given predictor. The simplest effect sizes are 
based on the data in hand and sample size is not considered as part 
of tlie calculations. 

However, all clasbical parametric methods are correlational 
(Knapp, 1978; Thompson, 1988a) and do capitalize on sampling error 
as part of least squares analyses. Thus, the simpler effect sizes 
overestimate both the effect size in the full population and tht^ 
effect size likely to be realized in future studies. Correction 
formulas (Maxwell, camp & Arvey, 1981; Rosnow & Rosenthal, 1988) 
can be applied to estimate population effect sizes based on sample 
results (e.g.. Wherry, 193x) , or to estimate the effect size 
estimates likely in future samples (Herzberg, 1969). 

Corrections tend to be larger as either effects sizes or 
sample sizes become smaller, as illustrated by Thompson (in press) . 
rhus, with a very large effect size or a larje sample size, or 
both, it will matter less which, if any, corrections the researcher 
applies in estimating effect sizes. 

Cohen's (1988) perusal of published research suggests that a 
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correlation ratio of around 25% (r-.5) should be considered large 
in terns of typical findings across disciplines. The empirical 
meta-analytic work of Glass and others, which has yielded some 
additional ways of evaluating effect size, has also led to similar 
conclusions: 

In none of the dozen or so research literatures that 
we have integrated in the past five years have we 
ever encountered a cross-validated multiple 
correlation between study findings and study 
characteristics that was larger than approximately 
0.60. That is, I haven't seen a body of literature 
in which we can account for much more than u third 
of the variability in the results of «»ii^1ftff, [which 
is distinct from talking about results for only one 
smaller group of subjects]. (Glass, 1979, p. 13) 
Xntgrpmlnq PfW lltS Based on Likelihood of RanHr^^t iftn 
A third strategy emphasizes interpretation based on estimated 
likelihood that results will replicate. This emphasin is compatible 
with the basic purpose of science: isolating conclusions that 
replicate under stated cmditions. Notwithstanding some 
misconceptions to the contrary, significance tests do not evaluate 
the probability that results will generalize. 

The simplest methods for evaluating replicability partition 
the sample and then empirically eoinp»r« r-'Ml ts i»r.r-offff p ^^pi^ 
fiClitfi. Various sample partitioning methods include conventional 
cross-validation strategies and also the jackknife methods 
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developed by Tukey and his cclleagues (cf. Crask & Perreault, 1977; 
Daniel, 1989). 

The cross-validation methods involve randomly splitting the 
sample into two subsets, conducting separate analyses, and then 
empirically comparing the results. Table 3 presents data for a 
multiple regression example involving two variables ("p» and "R") 
used to predict the dependent variable ("DV"). The first three 
subjects were assigned to the first invar iance subgroup 
(••INV««-"i"), while the last four subjects were purportedly randomly 
assigned to the second invariance group. Appendix A presents the 
SPSS-X commands used to conduct the empirical invariance analysis 
for these data. 



INSERT TABLE 3 ABOUT HERE. 



The invariance statistics are produced by the CORRELATIONS 
procedure. Table 4 presents the invariance results. Por this very 
small data set, the results are not replicable across subsamples. 
The researcher hopes that invariance coefficients will approach 
one. Negative values would be very disturbing indeed. But when 
results appear to be replicable, the researcher can interpret the 
set of results involving all the subjects with more confidence. The 
results for all the subjects are always used as the basis for 
interpretation, since the full sample should theoretically provide 
the most generalizable results; sample splitting is only performed 
to evaluate the replicability of the results. 
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INSERT TABLE 4 ABOUT HERE. 



It is very important that result replicability be investigated 
empirically rather than by subjectively comparing solutions for 
subpamples. Results can appear very different but actually yield 
comparable effect sizes. Such cases involve what cliff (1987, pp. 
177-178) refers to as the "sensitivity" of prediction weights. 

The most powerful stratesy for evaluating result replicability 
invokes the "bootstrap" methods developed by Efron and his 
colleagues (cf. Diaconis & Efron, 1983; Efron, 1979; Lunneborg, in 
press), conceptually, these methods involve copying the data set 
over again and again many many times into a large "mega" data set. 
Then dozens (or hundreds or thousands) of different samples are 
drawn from the "mega" file, and results are computed separately for 
each sample and then averaged. The method is powerful because the 
analysis considers so many configurations of subjects and informs 
the researcher regarding the extent to which results generalize 
across different configurations of subjects. Lunneborg (1987) has 
offered some excellent computer programs that automate this logic 
for univariate applications; Thompson (1988c) provides similar 
software for multivariate applications. 

Table 5 presents a small data set that can be used to 
illustrate "bootstrap" estimation. Table 6 presents descriptive 
statistics for the data in hand. Table 7 presents "bootstrap" 
estimates of population correlation coefficients based on the Table 
5 data. These estimates were developed using the software available 
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frcsr lAinneborg (1987), and were based on 500 samples with 
replacenent. 



INSERT TABLES 5, 6 AND 7 ABOUT HERE. 

As Thompson (1989b, p. 4) notes, "significance, importance, 
and replicability are all important issues in res«?arch. Too many 
researchers attend only to issues of significance in their 
research. TaiA in some respects, statistical significance may be the 
least important element of this research triumvirate." The 
interpretation of empirical results should augment the 
interpretation of significance tests with (a) interpratation of 
significance tests in a sample size context; (b) interpretation of 
effect sizes? and (c) interpretation based on estimated likelihood 
that results will replicate. These applications were illustrated 
with small heuristic data &ets to make the discussion more 
concrete. 
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Table 1 

Test of Mean Differences for School District Example 

i,«!f^,. - M2 > / (((SD1**2/ nl ) + (SD2**2/ n2 ) ) ** 

(100.15 - 99.85} / (((15**2 / 12000) + (15**2 / 188000)) ** 

" / ((( 225 / 12000) + ( 225 / 188000)) ** 

/ (( 0.01875 + 0.001196 ) ** 

/ ( 0.019946808 ** 

/ 0.141233170 
2.124146887 



0.3 
0.3 
0.3 
0.3 



.5) 
.5) 
.5) 
.5) 
.5) 



Note. From Thompson (1990) , with permission. 



Table 2 

Statistical Significance at Various Sample Sizes 
for a Fixed Effect size (Large Effect size) 

F calc F crit Decision 
0.253 200.00 Not Rej 



4.10 Not Rej 



Source 


SOS 


r sq 


df 


MS 


SOSexp 


337.2 


0.336 


2 


168.600 


SOSunexp 


665.1 




1 


665.100 


SOStot 


1002.3 




3 


334.100 


SOSexp 


337.2 


0.336 


2 


168.600 


SOSunexp 


665.1 




10 


66.510 


SOStot 


1002.3 




12 


83.525 


SOSexp 


337.2 


0.336 


2 


168.600 


SOSunexp 


665.1 




20 


33.255 


SOStot 


1002.3 




22 


45.559 


SOSexp 


337.2 


0.336 


2 


168.600 


SOSunexp 


665.1 




30 


22.170 


SOStot 


1002.3 




32 


31.322 



3.49 Rej 



3.32 Rej 



5SJfi*«^ xAA?}? inc^ases* tabled "critical £" values get 

fJJii!^ "^^P^* »^20 increases, ?rror fl£ gits 

larger, mean square error gets smaller, and thus -calculated F" 

?iese'analit«f ^""^ji^" ^^^^^ th« purioses of 

these analyses. From Thompson (1989b), with permission; 
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Table 3 

Observed and Latent Variables for Small Example Case 

YHATll YHAT12 YHAT21 YHAT22 
.515 -.873 
-1.152 .304 
.637 .570 

-.296 
.474 
1.245 
-1.423 



P R DV INV 


ZPl 


ZRl 


ZP2 


ZR2 


1 3 90 


1 


-I. 000 


-.132 


• 


• 


2 6 49 


1 


.000 


1.060 


• 


• 


3 1 93 


1 


1.000 


-.927 




• 


4 8 20 


2 






-1.162 


.669 


5 4 3 


2 






-.387 


-.304 


6 0 39 


2 






.387 


-1.276 


7 9 63 


2 






1.162 


.912 



-.779 
-.411 
-.042 
1.232 



Note. From Thompson (1989b), with permission. 



Table 4 
Invariance Statis<:ics 



DV 

< 

YHATll 1.0000 
(n-3) 



YHATll YHAT12 YHAT21 



YHAT12 -.2842 -.2843 
(n-3) (n-3) 
b 

YHAT21 -.5182 
(n=4) 

a c 
YHAT22 .8747 . . -.5924 

(n»4) (n»4) 

Note, From Thompson (1989b), with permission. 

a 

The multiple correlation coefficient (fi) for the invariance 
cfroup. 
b 

^The "shrunken B" for the invariance group. 
The invariance coefficient for the invariance group. 
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n MILESEC 



1 890 

2 1097 

3 1300 

4 948 
940 
760 
740 
571 
748 



10 640 

11 642 

12 957 



(+0.18) 
(+1.16) 
(+2.12) 
(+0.45) 
(+C.41) 
(-0.44) 
(-0.53) 
(-1.33) 
(-0.50) 
(-1.01) 
(-1.00) 
(+0.49) 



Table 5 

Data Set for Heuristic Example 
SYSTOLAV POND TOTCHOL 



94.0(-1.04) 
108.7(+1.42) 
97.7(-0.42) 
90.3(-1.66) 
100.7(+0.08) 
104. 3 (+0.69) 
95.3(-0.82) 
97.7(-0.42) 
102.7(+0.42) 
96.0(-0.70) 
107.0(+1.14) 
108.0(+1.30) 



11.5(-0.91 
12.0(^0.69 
13.1(-0.21 
12.6(^0.43 
19.3(+2.49 
14.7(+0.48 
14.? {+0.26 
13.6(+0.00 
10.9(-1.17 
li.4(-0.95 
14. 6 (+0.44 
15.2(+0.70 



180(+0.64) 
142(^1.56) 
165(-0.23) 
199(+1.74) 
187 (+1.04) 
148(-1.22) 
164(-0.29) 
174(+0.29) 
190(+1.22) 
161(-0.46) 
159(-0.58) 
159(-0.58) 



HDLCHOL 



80.1 
51.1 
63.3 
75.7 
61.0 
76.0 
78.5 
54.3 
62.2 
67.4 
34.8 
70.8 



(+1.17) 
(-1.02) 
(-0.10) 
(+0.84) 
(-0.27) 
(+0.86) 
(+1.05) 
(-0.78) 
(-0.18) 
(+0.21) 
(-2.25) 
(+0.47) 



Note. From Thompson (1990), with permission. 



^ . Table 6 

Descriptive Statistics and Correlation Coefficients 

MILESEC SYSTOLAV POND 
Mean 852.8 100.2 
SD 211.0 6.0 

MILESEC . 052 

SYSTOLAV 
POND 
TOTCHOL 
HDLCHOL 
PREDCl 

Note. Prom Thompson (1990) , with permission. 
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POND 


TOTCHOL HDLCHOL 


PREDCl 


CRITCl 


13.6 


169.0 


64.6 


0.0 


0.0 


2.3 


17.3 


13.2 


1.0 


1.0 


.046 


-.047 


.140 


.063 


.048 


.244 


-.624 


-.559 


-.981 


-.752 




.008 


-.121 


-.084 


-.064 






.243 


.637 
.569 


.830 
.742 
.767 
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Table 7 

Bootstrap Estimates of r's for Table 3 Data 
Based on 500 Samples with Replacement 



Table 4 

Coef. Estimate 

1 0.052 

2 0.046 

3 -0.047 

4 0.140 

5 0.244 

6 -0.624 

7 -0.559 

8 0.008 

9 -0.121 
10 0.243 



Bootstrap 

Mean 
0.0514 
0.0421 

-0.0233 
0.1135 
0.2598 

-0.5878 

-0.5430 

-0.0649 

-0.0971 
0.2189 



Bootstrap 

Median 
0.0417 
0.0603 

-0.0489 
0.1551 
0.2428 

-0.6196 

-0.5737 

-0.0519 

-0.1198 
0.2486 



Bootstrap 
SD 

0.2819 
0.2287 
0.2690 
0.3092 
0.2343 
0.2135 
0.2166 
0.3541 
0.2224 
0.2560 



Note. From Thompson (1990), with permission. 
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APPENDIX A: Example SPSS-X Conunands for Table 2 Data 

TITLE 'Demo of Regression Invariance Procedure***' 

FILE HANDLE BT/NAME-* DEMO7301.DAT' 

DATA LIST FILE-BT/P R 1-2 DV 3-4 INV 5 

if (inv eq 1) zpl-(p-2. 0)/l.o 

if (inv eq l)zrl-(r-3.333)/2.517 

if (inv eq 2) zp2-(p-5. 5)/1.291 

if (inv eq 2) Er2-(r-5. 25)/4. 113 

if (inv eq 1) yhatll-(-.371189*zpl)+(-i.087694*zrl) 

if (inv eq 1) yhatl2-( .83549*zpl)+(.286434*zrl) 

If (inv eq 2) yhat21-(-.371189*zp2)+(-1.087694*zr2) 

If (inv eq 2) yhat22-(.83549*zp2)+(.286434*zr2) 

variable labels yhatll 'group l data using group 1 betas' 

yhatl2 'group i data using group 2 betas' 
yhat21 'group 2 data using group l betas' 
... ^ yhat22 'group 2 data using group 2 betas' 

print formats zpl to yhat22 (F8.5) 

list varirMes-p to yhat22 

SUBTITLE 'REGRESSION USING ALL DATA' 

^^ENTER^pV^^^^^'^ ™ DV/DESCRIPTIVES-ALL/DEPENDENT-DV/ 
TEMPORARY 

SELECT IF (INV EQ 1) 

SUBTITLE 'REGRESSION FOR SUBGROUP #1' 

'^^EOTER^pV'^^^^^^'^ DV/DESCRIPTIVES-ALL/DEPENDENT-DV/ 
TEMPORARY 

SELECT IF (INV EQ 2) 

SUBTITLE 'REGRESSION FOR SUBGROUP #2' 

'^ENTER^pV^^^^"^^"^ ^ DV/DESCRIPTIVES-ALL/DEPENDENT-DV/ 
subtitle 'check z calculations' 
condescriptive zpl to yhat22 
statistics all 

subtitle 'invariance results #######' 

correlations variables-dv yhatll to yhat22/stati8tics-all 

S^d i"«'L*i/*^° requires two runs. The first run excludes the cards 
SS^friS f«r-K°1^* conducted to derive the numerical values 

required for the lower case cards, which are added for the second run! 



20 



ERIC 



23 



